dataset metadata
Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets
Altaher, Yousef, Fadel, Ali, Alotaibi, Mazen, Alyazidi, Mazen, Al-Mutairi, Mishari, Aldhbuiub, Mutlaq, Mosaibah, Abdulrahman, Rezk, Abdelrahman, Alhendi, Abdulrazzaq, Shal, Mazen Abo, Alghamdi, Emad A., Alshaibani, Maged S., Zakraoui, Jezia, Mohammed, Wafaa, Gaanoun, Kamel, Elmadani, Khalid N., Ghaleb, Mustafa, Tazi, Nouamane, Alharbi, Raed, Masoud, Maraim, Alyafeai, Zaid
Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, developing an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for users and researchers exploring the catalogue, several design and user experience challenges must be resolved. Furthermore, user interactions with the website may provide an easy approach to improve the catalogue. In this paper, we introduce Masader Plus, a web interface for users to browse Masader. We demonstrate data exploration, filtration, and a simple API that allows users to examine datasets from the backend. Masader Plus can be explored using this link https://arbml.github.io/masader. A video recording explaining the interface can be found here https://www.youtube.com/watch?v=SEtdlSeqchk.
AutoML using Metadata Language Embeddings
Drori, Iddo, Liu, Lu, Nian, Yi, Koorathota, Sharath C., Li, Jie S., Moretti, Antonio Khalil, Freire, Juliana, Udell, Madeleine
As a human choosing a supervised learning algorithm, it is natural to begin by reading a text description of the dataset and documentation for the algorithms you might use. We demonstrate that the same idea improves the performance of automated machine learning methods. We use language embeddings from modern NLP to improve state-of-the-art AutoML systems by augmenting their recommendations with vector embeddings of datasets and of algorithms. We use these embeddings in a neural architecture to learn the distance between best-performing pipelines. The resulting (meta-)AutoML framework improves on the performance of existing AutoML frameworks. Our zero-shot AutoML system using dataset metadata embeddings provides good solutions instantaneously, running in under one second of computation. Performance is competitive with AutoML systems OBOE, AutoSklearn, AlphaD3M, and TPOT when each framework is allocated a minute of computation. We make our data, models, and code publicly available.
Beyond research data infrastructures: exploiting artificial & crowd i…
Web pages indexed by Google (plus gazillion of temporal snapshots) Embedded markup (RDFa, Microdata, Microformats) for annotation of Web pages Supports Web search & interpretation Pushed by Google, Yahoo, Bing et al (schema.org Factual errors, annotation errors (see also [Meusel et al, ESWC2015]) o Ambiguity & coreferences. Relevance: supervised coreference resolution 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 02/10/19 11 1. Relevance: supervised coreference resolution 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 02/10/19 12 1. Rich Context & Coleridge Initiative building (yet another) KG of scholarly resources & datasets 13Stefan Dietze Context/corpus: publications (currently: social sciences, SAGE Publishing) Tasks: I. Extraction/disambiguation of dataset mentions II.